班级增量学习(CIL)吸引了很多关注,但是大多数现有相关的作品都集中在微调整个表示模型上,这不可避免地导致了许多灾难性的遗忘。相比之下,使用语义丰富的预训练的表示模型,参数 - 辅助调整(PAT)仅更改很少的参数来学习新的视觉概念。最近的研究证明,基于PAT的CIL自然可以避免像大多数现有方法一样通过重播或蒸馏而战斗。但是,我们发现基于PAT的CIL仍然面临严重的语义漂移,这是由分类器学习偏见在不同学习阶段引起的问题,这大大降低了基于PAT的CIL的性能。为了解决这个问题,我们提出了增量原型调整(IPT),这是一种简单但有效的方法,它调整了分类和学习示例原型的类别原型,以补偿语义漂移。广泛的实验表明,我们的方法可以有效地补偿语义漂移。与经过良好训练的VIT骨架和其他PAT方法相结合,IPT超过了主流学习基准的最新基准。
translated by 谷歌翻译
Stance detection models may tend to rely on dataset bias in the text part as a shortcut and thus fail to sufficiently learn the interaction between the targets and texts. Recent debiasing methods usually treated features learned by small models or big models at earlier steps as bias features and proposed to exclude the branch learning those bias features during inference. However, most of these methods fail to disentangle the ``good'' stance features and ``bad'' bias features in the text part. In this paper, we investigate how to mitigate dataset bias in stance detection. Motivated by causal effects, we leverage a novel counterfactual inference framework, which enables us to capture the dataset bias in the text part as the direct causal effect of the text on stances and reduce the dataset bias in the text part by subtracting the direct text effect from the total causal effect. We novelly model bias features as features that correlate with the stance labels but fail on intermediate stance reasoning subtasks and propose an adversarial bias learning module to model the bias more accurately. To verify whether our model could better model the interaction between texts and targets, we test our model on recently proposed test sets to evaluate the understanding of the task from various aspects. Experiments demonstrate that our proposed method (1) could better model the bias features, and (2) outperforms existing debiasing baselines on both the original dataset and most of the newly constructed test sets.
translated by 谷歌翻译
Text-based speech editing allows users to edit speech by intuitively cutting, copying, and pasting text to speed up the process of editing speech. In the previous work, CampNet (context-aware mask prediction network) is proposed to realize text-based speech editing, significantly improving the quality of edited speech. This paper aims at a new task: adding emotional effect to the editing speech during the text-based speech editing to make the generated speech more expressive. To achieve this task, we propose Emo-CampNet (emotion CampNet), which can provide the option of emotional attributes for the generated speech in text-based speech editing and has the one-shot ability to edit unseen speakers' speech. Firstly, we propose an end-to-end emotion-selectable text-based speech editing model. The key idea of the model is to control the emotion of generated speech by introducing additional emotion attributes based on the context-aware mask prediction network. Secondly, to prevent the emotion of the generated speech from being interfered by the emotional components in the original speech, a neutral content generator is proposed to remove the emotion from the original speech, which is optimized by the generative adversarial framework. Thirdly, two data augmentation methods are proposed to enrich the emotional and pronunciation information in the training set, which can enable the model to edit the unseen speaker's speech. The experimental results that 1) Emo-CampNet can effectively control the emotion of the generated speech in the process of text-based speech editing; And can edit unseen speakers' speech. 2) Detailed ablation experiments further prove the effectiveness of emotional selectivity and data augmentation methods. The demo page is available at https://hairuo55.github.io/Emo-CampNet/
translated by 谷歌翻译
Large-scale cross-modal pre-training paradigms have recently shown ubiquitous success on a wide range of downstream tasks, e.g., zero-shot classification, retrieval and image captioning. However, their successes highly rely on the scale and quality of web-crawled data that naturally contain incomplete and noisy information (e.g., wrong or irrelevant content). Existing works either design manual rules to clean data or generate pseudo-targets as auxiliary signals for reducing noise impact, which do not explicitly tackle both the incorrect and incomplete challenges simultaneously. In this paper, to automatically mitigate the impact of noise by solely mining over existing data, we propose a principled Noise-robust Language-Image Pre-training framework (NLIP) to stabilize pre-training via two schemes: noise-harmonization and noise-completion. First, in noise-harmonization scheme, NLIP estimates the noise probability of each pair according to the memorization effect of cross-modal transformers, then adopts noise-adaptive regularization to harmonize the cross-modal alignments with varying degrees. Second, in noise-completion scheme, to enrich the missing object information of text, NLIP injects a concept-conditioned cross-modal decoder to obtain semantic-consistent synthetic captions to complete noisy ones, which uses the retrieved visual concepts (i.e., objects' names) for the corresponding image to guide captioning generation. By collaboratively optimizing noise-harmonization and noise-completion schemes, our NLIP can alleviate the common noise effects during image-text pre-training in a more efficient way. Extensive experiments show the significant performance improvements of our NLIP using only 26M data over existing pre-trained models (e.g., CLIP, FILIP and BLIP) on 12 zero-shot classification datasets, MSCOCO image captioning and zero-shot image-text retrieval tasks.
translated by 谷歌翻译
Text Summarization is recognised as one of the NLP downstream tasks and it has been extensively investigated in recent years. It can assist people with perceiving the information rapidly from the Internet, including news articles, social posts, videos, etc. Most existing research works attempt to develop summarization models to produce a better output. However, advent limitations of most existing models emerge, including unfaithfulness and factual errors. In this paper, we propose a novel model, named as Knowledge-aware Abstractive Text Summarization, which leverages the advantages offered by Knowledge Graph to enhance the standard Seq2Seq model. On top of that, the Knowledge Graph triplets are extracted from the source text and utilised to provide keywords with relational information, producing coherent and factually errorless summaries. We conduct extensive experiments by using real-world data sets. The results reveal that the proposed framework can effectively utilise the information from Knowledge Graph and significantly reduce the factual errors in the summary.
translated by 谷歌翻译
Dunhuang murals are a collection of Chinese style and national style, forming a self-contained Chinese-style Buddhist art. It has very high historical and cultural value and research significance. Among them, the lines of Dunhuang murals are highly general and expressive. It reflects the character's distinctive character and complex inner emotions. Therefore, the outline drawing of murals is of great significance to the research of Dunhuang Culture. The contour generation of Dunhuang murals belongs to image edge detection, which is an important branch of computer vision, aims to extract salient contour information in images. Although convolution-based deep learning networks have achieved good results in image edge extraction by exploring the contextual and semantic features of images. However, with the enlargement of the receptive field, some local detail information is lost. This makes it impossible for them to generate reasonable outline drawings of murals. In this paper, we propose a novel edge detector based on self-attention combined with convolution to generate line drawings of Dunhuang murals. Compared with existing edge detection methods, firstly, a new residual self-attention and convolution mixed module (Ramix) is proposed to fuse local and global features in feature maps. Secondly, a novel densely connected backbone extraction network is designed to efficiently propagate rich edge feature information from shallow layers into deep layers. Compared with existing methods, it is shown on different public datasets that our method is able to generate sharper and richer edge maps. In addition, testing on the Dunhuang mural dataset shows that our method can achieve very competitive performance.
translated by 谷歌翻译
Previous databases have been designed to further the development of fake audio detection. However, fake utterances are mostly generated by altering timbre, prosody, linguistic content or channel noise of original audios. They ignore a fake situation, in which the attacker manipulates an acoustic scene of the original audio with another forgery one. It will pose a major threat to our society if some people misuse the manipulated audio with malicious purpose. Therefore, this motivates us to fill in the gap. This paper designs such a dataset for scene fake audio detection (SceneFake). A manipulated audio in the SceneFake dataset involves only tampering the acoustic scene of an utterance by using speech enhancement technologies. We can not only detect fake utterances on a seen test set but also evaluate the generalization of fake detection models to unseen manipulation attacks. Some benchmark results are described on the SceneFake dataset. Besides, an analysis of fake attacks with different speech enhancement technologies and signal-to-noise ratios are presented on the dataset. The results show that scene manipulated utterances can not be detected reliably by the existing baseline models of ASVspoof 2019. Furthermore, the detection of unseen scene manipulation audio is still challenging.
translated by 谷歌翻译
Partial label learning (PLL) is a typical weakly supervised learning, where each sample is associated with a set of candidate labels. The basic assumption of PLL is that the ground-truth label must reside in the candidate set. However, this assumption may not be satisfied due to the unprofessional judgment of the annotators, thus limiting the practical application of PLL. In this paper, we relax this assumption and focus on a more general problem, noisy PLL, where the ground-truth label may not exist in the candidate set. To address this challenging problem, we further propose a novel framework called "Automatic Refinement Network (ARNet)". Our method consists of multiple rounds. In each round, we purify the noisy samples through two key modules, i.e., noisy sample detection and label correction. To guarantee the performance of these modules, we start with warm-up training and automatically select the appropriate correction epoch. Meanwhile, we exploit data augmentation to further reduce prediction errors in ARNet. Through theoretical analysis, we prove that our method is able to reduce the noise level of the dataset and eventually approximate the Bayes optimal classifier. To verify the effectiveness of ARNet, we conduct experiments on multiple benchmark datasets. Experimental results demonstrate that our ARNet is superior to existing state-of-the-art approaches in noisy PLL. Our code will be made public soon.
translated by 谷歌翻译
异构综合学习粒子群优化(HCLPSO)是一种具有增强探索和开发能力的进化算法。与随机序列相比,覆盖搜索空间的低阶段序列(LDS)在覆盖搜索空间方面更均匀。在本文中,研究了利用LDS的良好均匀性来改善HCLPSO。进行数值实验以表明仅通过使用LDS生成初始种群,就不可能有效地提高HCLPSO的搜索能力。但是,如果我们从HCLPSO速度更新公式中正确选择一些随机序列并将其替换为确定性LDS,则可以获得更有效的算法。与原始的HCLPSO在相同的精度要求下相比,使用确定性LDS更新速度的HCLPSO可以显着降低找到最佳解决方案所需的迭代,而不会降低成功率。
translated by 谷歌翻译
开放世界对象检测是一个更具笼统和挑战性的目标,旨在识别和本地化由任意类别名称描述的对象。最近的工作GLIP通过将检测数据集的所有类别名称连接到句子中,从而将此问题作为接地问题,从而导致类别名称之间的效率低下的相互作用。本文介绍了Distclip,这是一种通过诉诸于设计概念词典的知识富集,是一种平行的视觉概念训练预训练方法,用于开放世界检测。为了提高学习效率,我们提出了一种新型的并行概念公式,该公式分别提取概念,以更好地利用异质数据集(即检测,接地和图像文本对)进行培训。我们进一步设计了来自各种在线资源和检测数据集的概念字典〜(带有描述),以提供每个概念的先验知识。通过用描述丰富这些概念,我们明确地建立了各种概念之间的关系,以促进开放域学习。所提出的概念词典进一步用于提供足够的负面概念,用于构建单词区域对齐损失\,并完成图像对文本对数据标题中缺少描述的对象的标签。所提出的框架显示出强烈的零射击性能性能,例如,在LVIS数据集上,我们的DETCLIP-T优于9.9%的地图GLIPT-T优于GLIP-T,并且与完全避免的型号相比,稀有类别的稀有类别提高了13.5%。作为我们的。
translated by 谷歌翻译